RNAseq analysis of her3 knockout and morpholino zebrafish samples

Goals:

Samples:

Zebrafish fusion samples /gpfs0/home1/gdkendalllab/lab/raw_data/fastq/2016_01_01/ sample tumor fusion background A667T19 D738 VGLL2-NCOA2 P53-/- A667T20 D739 VGLL2-NCOA2 AB/TL A667T21 D742 VGLL2-NCOA2 P53-/- A667T22 D777 VGLL2-NCOA2 AB/TL A667T23 D799 VGLL2-NCOA2 AB/TL A667T24 D800 VGLL2-NCOA2 AB/TL A667T25 D801 VGLL2-NCOA2 AB/TL A667T26 D807 VGLL2-NCOA2 AB/TL A667T27 D808 VGLL2-NCOA2 AB/TL A667T28 D813-Head VGLL2-NCOA2 AB/TL A667T29 D813-Tail VGLL2-NCOA2 AB/TL D809 D809 VGLL2-NCOA2 AB/TL D811 D811 VGLL2-NCOA2 AB/TL D812-head D812-head VGLL2-NCOA2 AB/TL D812-tail D812-tail VGLL2-NCOA2 AB/TL D814 D814 VGLL2-NCOA2 AB/TL D815 D815 VGLL2-NCOA2 AB/TL D874 D874 VGLL2-NCOA2 AB/TL

Zebrafish P53/Kras mutant fish SRR6507311 mutant SRR6507312 mutant SRR6507313 mutant SRR6507300 whole fish SRR6507301 whole fish SRR6507302 whole fish

Zebrafish normal muscle /home/gdkendalllab/lab/raw_data/fastq/2016_01_02 A374-A375T23 A374-A375T24 D948 muscle D949 muscle D950 muscle D951 muscle D952 muscle

Zebrafish CICDUX4 tumors D655 D668 D669 D678 D681 D682 D684 D685 D686 D687 D688 D689

Mouse data from fusion in C2C12 cells /gpfs0/home1/gdkendalllab/lab/raw_data/fastq/2015_12_22/ A377-A378T1 - A377-A378T8 T7 and T8 are controls with C2C12 without fusion expression

Mouse normal skeletal muscle from SRA PRJNA625451 PRJNA608179 - Miiice in spaaaaace PRJNA819493 - 14, 36 weeks PRJNA813153

Human data from patient fusion tumors /gpfs0/home1/gdkendalllab/lab/raw_data/fastq/2016_12_14/ EGAR00001508614_SARC061 EGAR00001508618_SARC065 EGAR00001508623_SARC070-Primary EGAR00001508624_SARC070-Relapse1 EGAR00001508656_SARC102

Normal human skeletal muscle data from GTEx https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/gene_reads/gene_reads_2017-06-05_v8_muscle_skeletal.gct.gz

Approach:


Figure 2 of paper - moving to supplement and replace with comparison vs adult skeletal muscle ***


Allograft Figure 6 - Likely end up in supplemental data scatterplot of C2C12-VN2 vs adult muscle (maybe C2C12 control) expression vs VN2 human tumors vs adult muscle pathways co-enriched in both comparisons *** Kras vs our tumors PCA - include RNAseq despite P53 include adult muscle to check for batch effect heatmap scatterplot of Kras vs our tumors expression Fold change values do GSEA or pathway enrichment analysis on shared DE genes and uniquely DE genes


Level of fusion gene expression between mouse, zebrafish and human


do PCA/umap of Dr tumor samples


To analyze these samples, we aligned with HiSat2 and counted the number of reads mapping to each gene using featureCounts. We used EdgeR to test for differential expression between the groups. For most statistical analyses and plotting, we used R and ggplot2.

DE comparisons to make

Based on group names from sampleMetadata.xlsx

DrFusion vs DrSkMu DrKras vs DrKrasCtrl MmFusion vs MmSkMu HsFusion vs HsSkMu

AGDEX analysis

VGLL2NCOA2 human and fish tumors vs. mature skeletal muscle 

VGLL2NCOA2 human and fish tumors vs. KRAS driven RMS 

Mouse VGLL2NCOA2 and fish tumors vs. mature skeletal muscle 

Motif analysis (done)

VGLL2-NCOA2 expression in the mouse xenograft models

Determining reads that span the fusion breakpoint/levels of expression in the VN2 mouse allograft models. 

27 developmental gene signature

Are these genes expressed in mouse VGLL2-NCOA2 tumors compared to C2C12 controls? 

Are these genes expressed in patient VGLL2-NCOA2 tumors compared to mature skeletal muscle? 

Chip-Seq analysis from zebrafish tumors and C2C12 (as data is generated)

Tead motifs upstream of 27 developmental genes

iRDB from St. Jude Arf6 expression across tumors expression vs gene effect scatterplot

Workflow

Load libraries

library(tidyverse)
theme_set(theme_bw())
theme_update(plot.title = element_text(hjust = 0.5))
options(bitmaptype = "cairo")
library(gt)
library(edgeR)

Helper functions

Function created through modification of code here: https://nmikolajewicz.github.io/scMiko/articles/Reports/M18_ClusterOptimization_cao2019_10000cells_240122.html
plot_tabs <- function(plot_list_name, level = "###") {
  knitr_chunk_list <-
    lapply(seq_along(get(plot_list_name)), function(i) {
      tab_name <- names(get(plot_list_name))[i]

      a1 <- knitr::knit_expand(text = sprintf(paste0("\n",
                                              level,
                                              " %s\n"),
                                              tab_name))

      a2 <- knitr::knit_expand(text = sprintf("\n```{r %s, message=FALSE, warning=FALSE}",
                                              paste0(tab_name,
                                                    "_",
                                                    i)))

      a3 <- knitr::knit_expand(text = sprintf(paste0("\n",
                                                    plot_list_name,
                                                    "[['%s']]"),
                                                    tab_name))

      a4 <- knitr::knit_expand(text = "\n```\n\n")

      paste(a1, a2, a3, a4, collapse = "\n")
    })

  paste(knitr::knit(text = paste(knitr_chunk_list, collapse = "\n")))
}

Make up directory structure

for directoryName in \
  misc \
  slurmOut \
  sbatchCmds \
  input \
  output/fastqc \
  output/figures \
  output/kallisto \
  output/fusionAligned \
  output/aligned \
  output/aligned/mmSkMu/ \
  output/aligned/mmVGLL/ \
  output/aligned/Hs \
  output/aligned/drKras \
  output/aligned/drVGLL \
  output/de \
  output/geneCounts \
  output/trimmed
do
  if [ ! -d ${directoryName} ]
  then
    mkdir -p ${directoryName}
  fi 
done

Upstream prep on mouse tumor data

Run fastqc to assess data quality

#!/bin/sh
#SBATCH --account=gdkendalllab
#SBATCH --error=slurmOut/fastqc-%j.txt
#SBATCH --output=slurmOut/fastqc-%j.txt
#SBATCH --mem=2G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --job-name fastqc
#SBATCH --wait
#SBATCH --array=0-15
#SBATCH --time=1-00:00:00
#SBATCH --mail-user=matthew.cannon@nationwidechildrens.org
#SBATCH --mail-type=FAIL,REQUEUE,TIME_LIMIT_80

set -e ### stops bash script if line ends with error

echo ${HOSTNAME} ${SLURM_ARRAY_TASK_ID}

module purge

module load FastQC/0.11.9-Java-11.0.2

inputPath=/home/gdkendalllab/lab/raw_data/fastq/2015_12_22

fileArray=(${inputPath}/*fastq.gz)

fastqc \
  -o output/fastqc \
  -t 5 \
  --extract \
  ${fileArray[${SLURM_ARRAY_TASK_ID}]}
#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;
use Pod::Usage;


##############################
# By Matt Cannon
# Date: 6-22-21
# Last modified: 6-22-21
# Title: splitFastqcData.pl
# Purpose: Split fastqc_data.txt into sub-files based on modules
##############################

##############################
# Options
##############################


my $verbose;
my $help;
my $input;
my $outputDir = "";

# i = integer, s = string
GetOptions ("verbose"           => \$verbose,
            "help"              => \$help,
            "input=s"           => \$input,
            "outputDir=s"       => \$outputDir
      ) or pod2usage(0) && exit;

pod2usage(1) && exit if ($help);


##############################
# Global variables
##############################
my $outputFH;
my $baseColumn;


##############################
# Code
##############################

if ($outputDir eq "") {
    $input =~ /^(.+\/)/;
    $outputDir = $1;
    mkdir $outputDir . "/split_data"
}

##############################
### Stuff
### More stuff

open my $inputFH, "$input" or die "Could not open first input\n$!";

while (my $input = <$inputFH>){
    chomp $input;
    if ($input =~ /^>>END_MODULE/) {
        # Close file
        close $outputFH;
        undef $baseColumn;
    } elsif ($input =~ /^>>/) {
        # Open new file
        $input =~ s/^>>//;
        $input =~ s/ /_/g;
        $input =~ s/\t.+//;

        open $outputFH, ">", $outputDir . "/split_data/" . $input . ".txt";
    } elsif ($input !~ /^##/) {
        # Deal with data
        ## Find column that has base numbering
        if ($input =~ /Base|Position/ && $input =~ /^#/) {
            # Change the word Position to Base to make downstream plotting easier
            $input =~ s/Position/Base/;
            my @dataArray = split "\t", $input;
            for (my $i = 0; $i < scalar(@dataArray); $i++) {
                if ($dataArray[$i] =~ /^#*Base$/) {
                    $baseColumn = $i;
                }
            }
        }

        # write to file
        if (defined($baseColumn)) {
            my @dataArray = split "\t", $input;
            if ($dataArray[$baseColumn] =~ /-/) {
                my ($lowerNum, $upperNum) = split "-", $dataArray[$baseColumn];
                for (my $i = $lowerNum; $i <= $upperNum; $i++) {
                    my @tempArray = @dataArray;
                    $tempArray[$baseColumn] = $i;
                    print $outputFH join("\t", @tempArray), "\n";
                }
            } else {
                print $outputFH $input, "\n";
            }
        } else {
            print $outputFH $input, "\n";
        }
    }
}


##############################
# POD
##############################

#=pod
    
=head SYNOPSIS

Summary:    
    
    xxxxxx.pl - generates a consensus for a specified gene in a specified taxa
    
Usage:

    perl xxxxxx.pl [options] 


=head OPTIONS

Options:

    --verbose
    --help

=cut
sbatch sbatchCmds/fastqc.sh

rm output/fastqc/*.zip

for inFile in output/fastqc/*/fastqc_data.txt
do
  perl ~/scripts/splitFastqcData.pl -i ${inFile}
done

Plot Fastqc output for all samples

adapter_file_dirs <- list.dirs(
    path = "output/fastqc",
    full.names = TRUE,
    recursive = FALSE
)

to_plot <- tibble(
    file = c(
        "Adapter_Content.txt",
        "Per_base_sequence_quality.txt"
    ),
    column = c(
        "Nextera_Transposase_Sequence",
        "Mean"
    )
)

for (i in 1:nrow(to_plot)) {
    temp_data <- NULL

    for (data_dir in adapter_file_dirs) {
        temp_data <- read_delim(paste(data_dir,
            "/split_data/",
            to_plot$file[i],
            sep = ""),
        delim = "\t",
        col_names = TRUE,
        show_col_types = FALSE) %>%
            rename_all(~ str_replace_all(., " ", "_")) %>%
            rename_all(~ str_remove(., "[#']")) %>%
            select(Base, to_plot$column[i]) %>%
            mutate(Sample = str_remove(data_dir, "_001_fastqc") %>%
                str_remove(".+\\/")) %>%
            mutate(Read = str_remove(Sample, ".+_")) %>%
            mutate(Tech = str_remove(Sample, "_.+")) %>%
            bind_rows(temp_data)
    }

    max_y <- ceiling(max(temp_data[[to_plot$column[i]]]) * 1.1)

    plot_fastqc <- ggplot(temp_data,
                          aes(x = Base,
                              y = get(to_plot$column[i]),
                              group = Sample,
                              color = Sample)) +
        geom_line() +
        geom_point() +
        facet_wrap(~Tech, ncol = 1) +
        ylim(0, max_y) +
        ylab(to_plot$column[i]) +
        theme(legend.position = "none") +
        ggtitle(str_remove(to_plot$file[i], ".txt"))

    png(
        filename = paste("output/figures/fastqc_",
            str_remove(to_plot$file[i], ".txt"),
            ".png",
            sep = ""),
        width = 7000,
        height = 3000,
        res = 300)
    print(plot_fastqc)
    dev.off()
}

input_reads <- read_tsv(list.files(path = "output/fastqc",
                                   recursive = TRUE,
                                   pattern = "Basic_Statistics.txt",
                                   full.names = TRUE),
                        id = "File",
                        show_col_types = FALSE) %>%
    filter(`#Measure` == "Total Sequences") %>%
    select(-`#Measure`) %>%
    mutate(
        File = str_remove(File, "output/fastqc/") %>%
            str_remove("_001_fastqc/split_data/Basic_Statistics.txt"),
        Sample = str_remove(File, "_S[0-9].+"),
        Lane = str_remove(File, ".+_L00") %>%
            str_remove("_R[12]"),
        Read = str_replace(File, ".+_R", "R"),
        Value = as.numeric(Value)) %>%
    rename(Reads = Value)

ggplot(input_reads, aes(x = Sample, y = Reads / 1000000, fill = Lane)) +
    geom_col() +
    facet_wrap(~Read, ncol = 1) +
    ylab("Reads (in millions)") +
    xlab("") +
    ggtitle("Sequenced Reads Per Sample") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggsave("output/figures/InputReads.png",
    height = 3000,
    width = 4000,
    units = "px"
)

Upstream prep on control mouse skeletal muscle data from SRA

Got human skeletal muscle data from GTEx

https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/gene_reads/gene_reads_2017-06-05_v8_muscle_skeletal.gct.gz

Edited the file by hand to get rid of extraneous header info

Aligning all data

Make Mm10 reference compatible with STAR

sbatch sbatchCmd/genStarRef.sh

Make Hg38.4 reference compatible with STAR

{bash starRef, eval=FALSE sbatch ref/starhg38.p4/genStarRef.sh

Make VGLL2NCOA2 fusion reference compatible with STAR

sbatch ref/fusionRef/genStarFusionRef.sh

Align mouse tumor data to the mm10 genome

#!/bin/sh
#SBATCH --account=gdkendalllab
#SBATCH --array=0-7
#SBATCH --error=slurmOut/alignMmVGLL-%j.txt
#SBATCH --output=slurmOut/alignMmVGLL-%j.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --job-name alignMmVGLL
#SBATCH --wait
#SBATCH --mail-user=matthew.cannon@nationwidechildrens.org
#SBATCH --mail-type=FAIL,REQUEUE,TIME_LIMIT_80
#SBATCH --time=2-00:00:00

set -e ### stops bash script if line ends with error

echo ${HOSTNAME} ${SLURM_ARRAY_TASK_ID}

ml purge

ml load GCCcore/8.3.0 \
        Trim_Galore/0.6.5-Java-11.0.2-Python-3.7.4

inputPath=/home/gdkendalllab/lab/raw_data/fastq/2015_12_22
tempPath=/gpfs0/scratch/mvc002/kendall

fileArray=(${inputPath}/*R1.fastq.gz)

R1=${fileArray[${SLURM_ARRAY_TASK_ID}]}
R2=${R1/R1.fastq/R2.fastq}

baseName=${R1%.R1.fastq.gz}
baseName=${baseName##*/}

trim_galore \
    --length 30 \
    -o ${tempPath} \
    -j 10 \
    --paired \
    ${R1} \
    ${R2}

# Hard trim to 50bp to keep consistent across samples
trimLen=150
perl ~/oldCode/oldScripts/trimFastq.pl \
    --max ${trimLen} \
    --fastq ${tempPath}/${baseName}.R1_val_1.fq.gz |
    pigz \
        > ${tempPath}/${baseName}_1.fastq.gz

perl ~/oldCode/oldScripts/trimFastq.pl \
    --max ${trimLen} \
    --fastq ${tempPath}/${baseName}.R2_val_2.fq.gz |
    pigz \
        > ${tempPath}/${baseName}_2.fastq.gz

ml purge
ml load GCC/7.3.0-2.30 \
        OpenMPI/3.1.1 \
        SAMtools/1.9

STAR \
    --runMode alignReads \
    --outSAMtype BAM SortedByCoordinate \
    --runThreadN 10 \
    --outFilterMultimapNmax 1 \
    --readFilesCommand zcat \
    --genomeDir ref/starmm10 \
    --readFilesIn ${tempPath}/${baseName}_1.fastq.gz \
                  ${tempPath}/${baseName}_2.fastq.gz \
    --outFileNamePrefix output/aligned/mmVGLL/${baseName##*/}

samtools index output/aligned/mmVGLL/${baseName##*/}Aligned.sortedByCoord.out.bam

rm ${tempPath}/${baseName}.R[12]_val_[12].fq.gz \
   ${tempPath}/${baseName}.R[12].fastq.gz_trimming_report.txt \
   ${tempPath}/${baseName}_[12].fastq.gz
STAR --version

sbatch sbatchCmds/alignMmVGLL.sh

Align mouse skeletal muscle data from SRA

Download from SRA Trim adapters Trim reads to 101bp to match mouse tumor samples Align to Mm genome
sbatch sbatchCmds/alignMmSkMuscle.sh

Align zebrafish P53/Kras mutant data to the zebrafish genome

sbatch sbatchCmds/alignDrKras.sh

Align zebrafhish VGLL2NCOA2 samples to the zebrafish genome

sbatch sbatchCmds/alignDrVGLL.sh

Align human data to the hg38.4 genome

sbatch sbatch/alignHs.sh

Align human, mouse and zebrafish VGLL2NCOA2 tumors to the VGLL2NCOA2 fusion sequence

sbatch sbatchCmds/alignFusion.sh

Count the number of reads aligned per gene

Count mouse genes

#!/bin/sh
#SBATCH --account=gdkendalllab
#SBATCH --error=slurmOut/countGenesMm-%j.txt
#SBATCH --output=slurmOut/countGenesMm-%j.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --job-name countGenesMm
#SBATCH --wait
#SBATCH --array=0-27
#SBATCH --mail-user=matthew.cannon@nationwidechildrens.org
#SBATCH --mail-type=FAIL,REQUEUE,TIME_LIMIT_80
#SBATCH --time=1-24:00:00

set -e ### stops bash script if line ends with error

echo ${HOSTNAME} ${SLURM_ARRAY_TASK_ID}

fileArray=(output/aligned/mmVGLL/*.bam
           output/aligned/mmSkMu/*Aligned.sortedByCoord.out.bam)

inFile=${fileArray[${SLURM_ARRAY_TASK_ID}]}
baseName=${inFile##*/}
baseName=${baseName%Aligned.sortedByCoord.out.bam}

ml purge
ml load HTSeq/0.12.4

htseq-count \
    -n 4 \
    -t exon \
    -c output/geneCounts/geneCountsMm_${baseName}.txt \
    -r pos \
    -s no \
    ${inFile} \
    /reference/mus_musculus/mm10/ucsc_assmebly/illumina_download/Annotation/Genes/genes.gtf
sbatch sbatchCmds/countMmGenes.sh

Count human data

sbatch sbatchCmds/countHsGenes.sh

Count zebrafish data

sbatch sbatchCmds/countDrGenes.sh

Plot percent of reads aligned

Get data from Log.final.out – need to re-do this section
grep_counts <- function(query, name) {
    system(paste0("grep '", query, "' output/aligned/*/*Log.final.out"),
           intern = TRUE) %>%
    as_tibble() %>%
    separate(col = value,
             into = c("sample", name),
             sep = "\t") %>%
    mutate(sample = str_remove(sample, "output/aligned/") %>%
                str_remove("Log.final.out.+"))
}

aligned_counts <-
    grep_counts("Number of input reads", "input_reads") %>%
    full_join(grep_counts("Uniquely mapped reads number", "mapped_reads"),
              by = "sample") %>%
    full_join(grep_counts("Uniquely mapped reads %", "perc_mapped"),
              by = "sample") %>%
    separate(col = sample,
             into = c("project", "sample"),
             sep = "/")

for species_abr in c("Mm", "Hs", "Dr") {
    tibble(assigned_reads =
           read_tsv(paste0("output/geneCounts/geneCounts", species_abr, ".txt"),
                    col_names = FALSE,
                    show_col_types = FALSE) %>%
        select(-X1) %>%
        colSums(),
    sample = system(""))
}

    read_tsv(list.files(path = "output/aligned/counts",
                        pattern = "_counts.txt",
                        full.names = TRUE),
             id = "Sample",
             col_names = "aligned_reads",
             show_col_types = FALSE) %>%
    mutate(Sample = str_remove(Sample, "output/aligned/counts/") %>%
            str_remove("_counts.txt"))

ggplot(aligned_counts, aes(x = Sample, y = aligned_reads / 1000000)) +
    geom_col() +
    labs(x = "",
         y = "Reads Aligned (in millions)",
         title = "Number of Reads Aligned to the Genome") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggsave("output/figures/CountAligned.png",
       height = 3000,
       width = 4000,
       units = "px")

Read in gene counts per sample

Remember to kick out the last few rows which have summary numbers Columns are arranged in alphabetic order by sample, double checked this by calculating colsums on each and comparing to output/aligned/*.final.out

Ortholog sources:

http://www.informatics.jax.org/downloads/reports/HOM_AllOrganism.rpt https://www.genenames.org/tools/hcop/ http://www.ensembl.org/biomart/martview/ https://www.alliancegenome.org/downloads

Merge read counts into a single table that can be used later

for (species in c("Dr", "Mm", "Hs")) {
    gene_counts <- read_tsv(list.files(path = "output/geneCounts",
                                       pattern = paste0("geneCounts",
                                                        species,
                                                        ".+.txt$"),
                                       full.names = TRUE),
                            id = "sample",
                            col_names = c("gene", "count"),
                            show_col_types = FALSE) %>%
        mutate(sample = str_remove(sample,
                                   paste0("output/geneCounts/geneCounts",
                                          species,
                                          "_")) %>%
                            str_remove(".txt")) %>%
        filter(grepl("^__", gene) == FALSE) %>%
        pivot_wider(names_from = "sample",
                    values_from = "count",
                    values_fill = 0)

    # keep only genes in the counts files for human
    if (species == "Hs") {
        gene_counts <-
            inner_join(gene_counts,
                      read_tsv("input/gene_reads_2017-06-05_v8_muscle_skeletal.gct.gz",
                               show_col_types = FALSE) %>%
                select(-id, -Name) %>%
                group_by(Description) %>%
                summarize(across(where(is.numeric), sum)) %>%
                rename(gene = Description)) %>%
            mutate(across(everything(), ~replace_na(.x, 0)))
    }

    # Keep only genes that are in the tumor samples since some of the samples
    # were prepped with a different library prep method
    if (species == "Mm") {
        sk_mu <-
            gene_counts %>%
            select(c(gene, starts_with("SRR"))) %>%
            rowwise() %>%
            mutate(row_sum = sum(c_across(where(is.numeric)))) %>%
            ungroup() %>%
            filter(row_sum > 0) %>%
            select(-row_sum)

        tumor <-
            gene_counts %>%
            select(c(gene, !starts_with("SRR"))) %>%
            rowwise() %>%
            mutate(row_sum = sum(c_across(where(is.numeric)))) %>%
            ungroup() %>%
            filter(row_sum > 0) %>%
            select(-row_sum)

        gene_counts <- inner_join(tumor, sk_mu, by = "gene")
    }

    write_tsv(gene_counts, paste0("output/geneCounts/geneCountsCombined",
                                  species,
                                  ".txt.gz"))
}

PCA on multiple datasets

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE109nnn/GSE109581/suppl/GSE109581%5FcountsMatrix%2Etxt%2Egz

PCA on zebrafish samples

VGLL2NCOA2 tumors KRAS GSE39731; GSE32425), CICDUX4 (us), MPNST (GSE11493), skeletal muscle (us); and MPNST/ERMS/leukemia/others from fish (GSE109581)

PCA

metadata <- read_tsv("misc/sampleMetadata.txt",
                       show_col_types = FALSE)

for (species in c("Dr", "Mm", "Hs")) {

    gene_counts <- read_tsv(paste0("output/geneCounts/geneCountsCombined",
                                   species,
                                   ".txt.gz"),
                            show_col_types = FALSE) %>%
    column_to_rownames("gene")

    # Add GTEX data to the metadata
    if (species == "Hs") {
        metadata <- metadata %>%
            filter(!grepl("GTEX", sample)) %>%
            bind_rows(tibble(sample = colnames(gene_counts)[grep("GTEX",
                                                                 colnames(gene_counts))],
                             group = "HsSkMu",
                             bioproject = "GTEX"))
    }

    groups <- metadata[metadata %>%
                       pull(sample) %>%
                       match(colnames(gene_counts), .), ] %>%
        pull(group)

    data_dge <- DGEList(counts = gene_counts, group = groups)

    kept_df <- filterByExpr(data_dge, min.count = 0.1)

    data_dge <- data_dge[kept_df, keep.lib.sizes = FALSE] %>%
        calcNormFactors(.) %>%
        estimateDisp(., model.matrix(~ groups))

    dim(data_dge$counts)

    norm_counts <-
        cpm(data_dge) %>%
        as.data.frame() %>%
        t()

    pca_out <- prcomp(norm_counts)

    pca_out_merged <-
        left_join(as.data.frame(pca_out$x) %>%
                    rownames_to_column("sample"),
                  metadata)

    variance <- (pca_out$sdev)^2 / sum(pca_out$sdev^2) * 100

    for (color_group in c("group", "bioproject")) {
        ggplot(pca_out_merged,
            aes(x = PC1,
                y = PC2,
                color = get(color_group),
                label = sample)) +
            geom_point(size = 2.5, color = "black") +
            geom_point(size = 2) +
            ggrepel::geom_text_repel(size = 2, max.overlaps = 20) +
            scale_color_brewer(palette = "Set2",
                               name = stringr::str_to_title(color_group)) +
            xlab(paste0("PC1 (",
                        round(variance[1], 1),
                        "%)")) +
            ylab(paste0("PC2 (",
                        round(variance[2], 1),
                        "%)")) +
            ggtitle(paste("PCA on", species))

        ggsave(paste0("output/figures/PCA_",
                      species,
                      "_",
                      color_group,
                      ".png"),
            height = 10,
            width = 13)

        pca_out_merged %>%
            select({{ color_group }}, paste0("PC", 1:9)) %>%
            pivot_longer(cols = -{{ color_group }},
                         names_to = "pc",
                         values_to = "pc_value") %>%
            group_by(pc) %>%
            mutate(scaled_pc_value = scale(pc_value)) %>%
            ggplot(aes(x = pc,
                       y = scaled_pc_value,
                       color = get(color_group))) +
            geom_boxplot(outlier.shape = NA) +
            geom_point(position = position_jitterdodge(jitter.width = 0.05),
                       size = 0.5) +
            scale_color_brewer(palette = "Set2",
                               name = stringr::str_to_title(color_group)) +
            ggtitle("Separation of Groups by PCs")

        ggsave(paste0("output/figures/PCA_",
                      species,
                      "_",
                      color_group,
                      "_boxplot.png"),
            height = 8,
            width = 13)
    }
}
## Joining, by = "sample"
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 43 rows containing missing values (geom_point).
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 387 rows containing missing values (geom_point).
## Joining, by = "sample"
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 72 rows containing missing values (geom_point).
## Joining, by = "sample"
## Warning: ggrepel: 556 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: ggrepel: 554 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 45 rows containing missing values (geom_point).

Do PCA/UMAP on just zebrafish tumors

metadata <- read_tsv("misc/sampleMetadata.txt",
                       show_col_types = FALSE)

gene_counts <- read_tsv(paste0("output/geneCounts/geneCountsCombinedDr.txt.gz"),
                        show_col_types = FALSE) %>%
    column_to_rownames("gene") %>%
    select(-matches("Muscle"),
            -`A374-A375T23`,
            -`A374-A375T24`)

groups <- metadata[metadata %>%
                    pull(sample) %>%
                    match(colnames(gene_counts), .), ] %>%
    pull(group)

data_dge <- DGEList(counts = gene_counts, group = groups)

kept_df <- filterByExpr(data_dge, min.count = 0.1)

data_dge <- data_dge[kept_df, keep.lib.sizes = FALSE] %>%
    calcNormFactors(.) %>%
    estimateDisp(., model.matrix(~ groups))

dim(data_dge$counts)
## [1] 26971    36
norm_counts <-
    cpm(data_dge) %>%
    as.data.frame() %>%
    t()

### Heatmaps
png("output/figures/heatmap_zebrafish_tumors.png",
    width = 2000,
    height = 2000,
    res = 300)
pheatmap::pheatmap(t(norm_counts),
                   color = colorRampPalette(rev(RColorBrewer::brewer.pal(n = 7,
                                                                         name = "PuOr")))(100),
                   show_rownames = FALSE,
                   scale = "row",
                   fontsize = 8,
                   annotation_names_col = FALSE,
                   annotation_col = data.frame(Group = groups,
                                               row.names = colnames(gene_counts)))
dev.off()
## png 
##   3
### PCA
pca_out <- prcomp(norm_counts)

pca_out_merged <-
    left_join(as.data.frame(pca_out$x) %>%
                rownames_to_column("sample"),
                metadata)
## Joining, by = "sample"
variance <- (pca_out$sdev)^2 / sum(pca_out$sdev^2) * 100

ggplot(pca_out_merged,
       aes(x = PC1,
           y = PC2,
           color = group,
           label = sample)) +
    geom_point(size = 2.5, color = "black") +
    geom_point(size = 2) +
    ggrepel::geom_text_repel(size = 2, max.overlaps = 20) +
    scale_color_brewer(palette = "Set2",
                        name = "Group") +
    xlab(paste0("PC1 (",
                round(variance[1], 1),
                "%)")) +
    ylab(paste0("PC2 (",
                round(variance[2], 1),
                "%)")) +
    ggtitle("PCA on Zebrafish Tumors")
## Warning: ggrepel: 18 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ggsave(paste0("output/figures/PCA_Dr_tumors.png"),
    height = 10,
    width = 13)
## Warning: ggrepel: 13 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
pca_out_merged %>%
    select(group, paste0("PC", 1:9)) %>%
    pivot_longer(cols = -group,
                    names_to = "pc",
                    values_to = "pc_value") %>%
    group_by(pc) %>%
    mutate(scaled_pc_value = scale(pc_value)) %>%
    ggplot(aes(x = pc,
                y = scaled_pc_value,
                color = group)) +
    geom_boxplot(outlier.shape = NA) +
    geom_point(position = position_jitterdodge(jitter.width = 0.05),
                size = 0.5) +
    scale_color_brewer(palette = "Set2",
                        name = stringr::str_to_title(color_group)) +
    ggtitle("Separation of Groups by PCs")

ggsave(paste0("output/figures/PCA_Dr_boxplot.png"),
    height = 8,
    width = 13)

set.seed(1337)
umap_out <- umap::umap(norm_counts, n_neighbors = 4)$layout %>%
    as.data.frame() %>%
    rownames_to_column("sample") %>%
    as_tibble() %>%
    left_join(metadata)
## Joining, by = "sample"
centroids <- umap_out %>%
    group_by(group) %>%
    summarize(V1 = mean(V1),
              V2 = mean(V2),
              .groups = "drop")

ggplot(umap_out,
        aes(x = V1,
            y = V2,
            color = group)) +
    geom_point(size = 4) +
    ggrepel::geom_text_repel(data = centroids,
                             aes(x = V1,
                                 y = V2,
                                 label = group),
                             color = "black",
                             force = 50) +
    scale_color_brewer(palette = "Paired",
                        name = "Group") +
    ggrepel::geom_text_repel(aes(label = sample))

ggsave(paste0("output/figures/umap_Dr_tumors_groups.png"),
    height = 8,
    width = 15)

Merged species PCA

merged_data <- tibble(Dr = character())

species_full <- list(Hs = "Homo sapiens",
                     Dr = "Danio rerio",
                     Mm = "Mus musculus")

for (species in c("Dr", "Mm", "Hs")) {

    gene_counts <- read_tsv(paste0("output/geneCounts/geneCountsCombined",
                                   species,
                                   ".txt.gz"),
                            show_col_types = FALSE) %>%
        rename({{ species }} := gene)

    if (species != "Dr") {
        gene_counts <- right_join(read_tsv("../refs/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz",
                                           show_col_types = FALSE,
                                           comment = "#") %>%
            dplyr::filter(Gene1SpeciesName == species_full[["Dr"]] &
                          Gene2SpeciesName == species_full[[species]]) %>%
            select(Gene1Symbol, Gene2Symbol) %>%
            rename(Dr = Gene1Symbol,
                   {{ species }} := Gene2Symbol) %>%
            distinct(),
                                  gene_counts)
    } else {
        gene_counts <- right_join(read_tsv("/home/gdkendalllab/lab/references/annotations/zebrafish_gene_info.txt.gz",
                                           show_col_types = FALSE,
                                           comment = "#") %>%
            select(Gene_name, Gene_stable_ID_version) %>%
            rename(Dr = Gene_stable_ID_version) %>%
            distinct(),
                                  gene_counts) %>%
            select(-Dr) %>%
            rename(Dr = Gene_name) %>%
            filter(!is.na(Dr)) %>%
            group_by(Dr) %>%
            summarize(across(where(is.numeric), sum), .groups = "drop")
    }

    merged_data <- full_join(merged_data, gene_counts) %>%
        na.omit()
}

merged_data <-
    merged_data %>%
    mutate(gene = paste(Dr, Mm, Hs, sep = "/"), .before = 1) %>%
    select(-Dr, -Mm, -Hs) %>%
    column_to_rownames("gene")

metadata <-
    read_tsv("misc/sampleMetadata.txt",
                       show_col_types = FALSE) %>%
    select(sample, group) %>%
    filter(!grepl("GTEX", sample)) %>%
    bind_rows(tibble(sample = colnames(gene_counts)[grep("GTEX",
                                                            colnames(gene_counts))],
                        group = "HsSkMu",
                        bioproject = "GTEX")) %>%
    arrange(match(sample, colnames(merged_data))) %>%
    mutate(species = substr(group, 1, 2))

groups <- metadata$group

data_dge <- DGEList(counts = merged_data, group = groups)

kept_df <- filterByExpr(data_dge, min.count = 0.1)

data_dge <- data_dge[kept_df, keep.lib.sizes = FALSE] %>%
    calcNormFactors(.) %>%
    estimateDisp(., model.matrix(~ groups))

dim(data_dge$counts)

norm_counts <-
    cpm(data_dge) %>%
    as.data.frame() %>%
    t()

### Heatmaps
set.seed(1337)
non_gtex_samples <- grep("GTEX", rownames(norm_counts), invert = TRUE)
gtex_samples_20 <- grep("GTEX", rownames(norm_counts)) %>%
    sample(size = 20)
sub_gtex <- norm_counts[c(non_gtex_samples, gtex_samples_20),]

annot_df <-  data.frame(Group = groups[c(non_gtex_samples, gtex_samples_20)],
                        row.names = colnames(merged_data)[c(non_gtex_samples, gtex_samples_20)]) %>%
    mutate(Species = substr(Group, 1, 2))

png("output/figures/heatmap_all.png",
    width = 3500,
    height = 2000,
    res = 300)
pheatmap::pheatmap(t(sub_gtex),
                   color = colorRampPalette(rev(RColorBrewer::brewer.pal(n = 7,
                                                                         name = "PuOr")))(100),
                   show_rownames = FALSE,
                   scale = "row",
                   fontsize = 8,
                   annotation_names_col = FALSE,
                   annotation_col = annot_df)
dev.off()

### PCA
pca_out <- prcomp(norm_counts)

pca_out_merged <-
    left_join(as.data.frame(pca_out$x) %>%
                rownames_to_column("sample"),
                metadata)

variance <- (pca_out$sdev)^2 / sum(pca_out$sdev^2) * 100

ggplot(pca_out_merged,
    aes(x = PC1,
        y = PC2,
        color = group)) +
    geom_point(size = 2.5, color = "black") +
    geom_point(size = 2) +
    scale_color_brewer(palette = "Paired",
                        name = "Group") +
    xlab(paste0("PC1 (",
                round(variance[1], 1),
                "%)")) +
    ylab(paste0("PC2 (",
                round(variance[2], 1),
                "%)")) +
    ggtitle(paste("PCA on", species))

ggsave(paste0("output/figures/PCA_all_groups.png"),
    height = 10,
    width = 13)

pca_out_merged %>%
    select(group, paste0("PC", 1:9)) %>%
    pivot_longer(cols = -group,
                 names_to = "pc",
                 values_to = "pc_value") %>%
    group_by(pc) %>%
    mutate(scaled_pc_value = scale(pc_value)) %>%
    ggplot(aes(x = pc,
               y = scaled_pc_value,
               color = group)) +
    geom_boxplot(outlier.shape = NA) +
    geom_point(position = position_jitterdodge(jitter.width = 0.05),
               size = 0.2) +
    scale_color_brewer(palette = "Paired",
                       name = "Group") +
    ggtitle("Separation of Groups by PCs")

ggsave(paste0("output/figures/PCA_all_boxplot.png"),
    height = 8,
    width = 13)

set.seed(1337)
umap_out <- umap::umap(norm_counts)$layout %>%
    as.data.frame() %>%
    rownames_to_column("sample") %>%
    as_tibble() %>%
    full_join(metadata)

centroids <- umap_out %>%
    group_by(group) %>%
    summarize(V1 = mean(V1),
              V2 = mean(V2),
              .groups = "drop")

ggplot(umap_out,
        aes(x = V1,
            y = V2,
            color = group)) +
    geom_point() +
    ggrepel::geom_text_repel(data = centroids,
                             aes(x = V1,
                                 y = V2,
                                 label = group),
                             color = "black",
                             force = 50) +
    scale_color_brewer(palette = "Paired",
                        name = "Group")

ggsave(paste0("output/figures/umap_all_groups.png"),
    height = 8,
    width = 15)

Run DE

Needed: DrFusion vs DrSkMu DrKras vs DrKrasCtrl MmFusion vs MmSkMu HsFusion vs HsSkMu
metadata <- read_tsv("misc/sampleMetadata.txt",
                       show_col_types = FALSE)

de_cmp_table <- tibble(group_1 = c("DrFusion", "DrKras", "MmFusion", "HsFusion"),
                        group_2 = c("DrSkMu", "DrKrasCtrl", "MmSkMu", "HsSkMu"))

logFC_plot_list <- list()
volcano_plot_list <- list()
scatterplot_list <- list()

for (i in seq_len(nrow(de_cmp_table))) {
## EdgeR portion
    species <- substr(de_cmp_table[i, "group_1"], 1, 2)
    count_data <- read_tsv(paste0("output/geneCounts/geneCountsCombined",
                                  species,
                                  ".txt.gz"),
                           show_col_types = FALSE)

    if (species == "Hs") {
        metadata <- metadata %>%
            filter(!grepl("GTEX", sample)) %>%
            bind_rows(tibble(sample = colnames(count_data)[grep("GTEX",
                                                                 colnames(count_data))],
                             group = "HsSkMu",
                             bioproject = "GTEX"))
    }

    # swap out gene stable ID for gene name and sum each gene across samples
    if (species == "Dr") {
        count_data <-
            inner_join(count_data,
                       read_tsv("/home/gdkendalllab/lab/references/annotations/zebrafish_gene_info.txt.gz",
                               show_col_types = FALSE,
                               col_select = c("Gene_stable_ID_version",
                                              "Gene_name")) %>%
                        mutate(gene = Gene_stable_ID_version) %>%
                        select(-Gene_stable_ID_version) %>%
                        distinct() %>%
                        na.omit()) %>%
            mutate(gene = Gene_name) %>%
            select(-Gene_name) %>%
            group_by(gene) %>%
            summarize(across(where(is.numeric), sum), .groups = "drop")
    }

    group_1_samples <- metadata$sample[metadata$group %in%
                                        de_cmp_table$group_1[i]]
    group_2_samples <- metadata$sample[metadata$group %in%
                                        de_cmp_table$group_2[i]]

    count_data <- count_data[, match(c("gene",
                                        group_1_samples,
                                        group_2_samples),
                                      colnames(count_data))] %>%
        column_to_rownames("gene")

    groups <- c(rep(de_cmp_table$group_1[i], length(group_1_samples)),
                rep(de_cmp_table$group_2[i], length(group_2_samples)))

    saveRDS(groups, file = paste0("output/de/",
                               de_cmp_table$group_1[i],
                               "_",
                               de_cmp_table$group_2[i],
                               "_groups.rds"))

    data_dge <- DGEList(counts = count_data, group = groups)

    kept_df <- filterByExpr(data_dge, min.count = 0.1)

    data_dge <- data_dge[kept_df, keep.lib.sizes = FALSE] %>%
        calcNormFactors(.) %>%
        estimateDisp(., model.matrix(~ groups))

    dim(data_dge$counts)

    tested <- exactTest(data_dge)

    results <- topTags(tested, n = Inf)
    results$table$Gene <- rownames(results$table)

    summarized_data <- cpm(data_dge) %>%
        as.data.frame() %>%
        rownames_to_column(var = "Gene") %>%
        as_tibble() %>%
        full_join(cpmByGroup(data_dge) %>%
            as.data.frame() %>%
            rename_all(~ str_c(.x, "_mean")) %>%
            rownames_to_column("Gene") %>%
            as_tibble(),
            by = "Gene") %>%
        full_join(results$table, by = "Gene") %>%
        rowwise() %>%
        mutate(signif = FDR <= 0.1)

    # add in zebrafish gene annotations
    if (species == "Dr") {
        summarized_data <-
            inner_join(summarized_data,
                       read_tsv("/home/gdkendalllab/lab/references/annotations/zebrafish_gene_info.txt.gz",
                                show_col_types = FALSE,
                                col_select = c("Gene_stable_ID_version",
                                               "Gene_name")) %>%
                       mutate(Gene = Gene_name) %>%
                       distinct()) %>%
                mutate(Gene = Gene_name) %>%
                select(-Gene_name, -Gene_stable_ID_version) %>%
                group_by(Gene) %>%
                distinct()
    }

    write_tsv(summarized_data, paste0("output/de/",
                                      de_cmp_table$group_1[i],
                                      "_",
                                      de_cmp_table$group_2[i],
                                      "_edgeR.txt"))

    ## Plots
    ### LogFC
    logFC_plot_list[[i]] <-
        ggplot(results$table, aes(x = logFC)) +
        geom_histogram(bins = 300) +
        ggtitle(paste0(de_cmp_table$group_1[i],
                       " vs ",
                       de_cmp_table$group_2[i],
                       " logFC Histogram"))

    names(logFC_plot_list)[i] <- paste(de_cmp_table$group_1[i],
                                       "vs",
                                       de_cmp_table$group_2[i])

    ggsave(paste0("output/figures/logFC_histogram_",
                    de_cmp_table$group_1[i],
                    "_",
                    de_cmp_table$group_2[i],
                    ".png"),
           logFC_plot_list[[i]],
           width = 6,
           height = 6)

    x_mean_colname <- paste0(de_cmp_table$group_1[i], "_mean")
    y_mean_colname <- paste0(de_cmp_table$group_2[i], "_mean")

    x_max <- (max(summarized_data[[x_mean_colname]]) / 1000)  %>%
                ceiling() * 1000
    y_max <- (max(summarized_data[[y_mean_colname]]) / 1000)  %>%
                ceiling() * 1000

    ### Scatterplot
    # Using "local" here because the variable x_mean_colname in the plot list
    # uses only the final value outside of the loop
    scatterplot_list[[i]] <-
        local({
            x_mean_colname <- x_mean_colname
            y_mean_colname <- y_mean_colname
            x_max <- x_max
            y_max <- y_max

            ggplot(summarized_data,
            aes(x = get(x_mean_colname),
                y = get(y_mean_colname))) +
            geom_line(data = tibble({{ x_mean_colname }} := c(0, x_max),
                                    {{ y_mean_colname }} := c(0, y_max)),
                      color = "red") +
            geom_point(alpha = 0.5, size = 0.5) +
            scale_x_log10() +
            scale_y_log10() +
            labs(x = paste0(de_cmp_table$group_1[i], " mean"),
                 y = paste0(de_cmp_table$group_2[i], " mean"),
                 title = paste0(de_cmp_table$group_1[i],
                                " vs ",
                                de_cmp_table$group_2[i],
                                " Scatterplot"))
        })

    names(scatterplot_list)[i] <- paste(de_cmp_table$group_1[i],
                                        "vs",
                                        de_cmp_table$group_2[i])

    ggsave(paste0("output/figures/scatterplot_",
                    de_cmp_table$group_1[i],
                    "_",
                    de_cmp_table$group_2[i],
                    ".png"),
           scatterplot_list[[i]],
           width = 6,
           height = 6)

    ### Volcano
    volcano_plot_list[[i]] <-
        ggplot(summarized_data, aes(x = logFC,
                                    y = -log10(PValue),
                                    color = signif)) +
        geom_point(alpha = 0.1) +
        scale_color_brewer(palette = "Set2", name = "Signif.") +
        labs(y = "-log10(p-value)",
             title = paste0(de_cmp_table$group_1[i],
                            " vs ",
                            de_cmp_table$group_2[i],
                            " Volcano Plot"))

    names(volcano_plot_list)[[i]] <- paste(de_cmp_table$group_1[i],
                                         "vs",
                                         de_cmp_table$group_2[i])

    ggsave(paste0("output/figures/volcano_plot_",
                    de_cmp_table$group_1[i],
                    "_",
                    de_cmp_table$group_2[i],
                    ".png"),
           volcano_plot_list[[i]],
           width = 10,
           height = 6)
}
## Joining, by = "gene"
## Joining, by = "Gene"
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Joining, by = "gene"
## Joining, by = "Gene"
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis

logFC Plots

DrFusion vs DrSkMu

logFC_plot_list[['DrFusion vs DrSkMu']] 

DrKras vs DrKrasCtrl

logFC_plot_list[['DrKras vs DrKrasCtrl']] 

MmFusion vs MmSkMu

logFC_plot_list[['MmFusion vs MmSkMu']] 

HsFusion vs HsSkMu

logFC_plot_list[['HsFusion vs HsSkMu']] 

Scatterplots

DrFusion vs DrSkMu

scatterplot_list[['DrFusion vs DrSkMu']] 

DrKras vs DrKrasCtrl

scatterplot_list[['DrKras vs DrKrasCtrl']] 

MmFusion vs MmSkMu

scatterplot_list[['MmFusion vs MmSkMu']] 

HsFusion vs HsSkMu

scatterplot_list[['HsFusion vs HsSkMu']] 

Volcano Plots

DrFusion vs DrSkMu

volcano_plot_list[['DrFusion vs DrSkMu']] 

DrKras vs DrKrasCtrl

volcano_plot_list[['DrKras vs DrKrasCtrl']] 

MmFusion vs MmSkMu

volcano_plot_list[['MmFusion vs MmSkMu']] 

HsFusion vs HsSkMu

volcano_plot_list[['HsFusion vs HsSkMu']] 

Compare our zebrafish data to other datasets

VGLL2-NCOA2 vs mouse

orthologs <- read_tsv("../refs/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz",
                      show_col_types = FALSE,
                      comment = "#") %>%
    filter(Gene2SpeciesName == "Danio rerio" &
           Gene1SpeciesName == "Mus musculus") %>%
    rename(mouse_gene = Gene1Symbol,
           zebrafish_gene = Gene2Symbol) %>%
    select(mouse_gene, zebrafish_gene) %>%
    distinct()

# In the zfin ortholog, arf6b has no mouse ortholog, but from other sources, it
# is a homolog of Arf6, so I'm going to manually edit that here
# Also, ensemble has Pfn2 as the mouse ortholog of pnf2l, so I'll also include that
orthologs <- bind_rows(tibble(mouse_gene = c("Arf6",
                                             "Pfn2"),
                              zebrafish_gene = c("arf6b",
                                                 "pfn2l")),
                       orthologs)

# This will duplicate rows where there are more than one human ortholog per zebrafish gene
fish_genes <- read_tsv("VGLL2NCOA2_27genelist.txt",
                       show_col_types = FALSE) %>%
    rename(zebrafish_gene = gene) %>%
    left_join(orthologs)

summary(fish_genes$mouse_gene %in% summarized_data$Gene)

combined_data <- left_join(fish_genes,
                  summarized_data %>%
                    rename(mouse_gene = Gene,
                           mouse_logFC = logFC) %>%
                    select(mouse_gene,
                           mouse_logFC,
                           signif,
                           Xenograft_C2C12_VGLL2_NCOA2_mean,
                           Xenograft_C2C12_pcDNA_mean)) %>%
    select(zebrafish_gene,
           mouse_gene,
           mouse_logFC,
           signif,
           mean_of_tumor_fpkm,
           mean_of_development,
           mean_of_muscle,
           Xenograft_C2C12_VGLL2_NCOA2_mean,
           Xenograft_C2C12_pcDNA_mean) %>%
    mutate(zebrafish_logFC = log2(mean_of_tumor_fpkm / mean_of_muscle))


combined_data %>%
    na.omit() %>%
    select(mouse_logFC,
           zebrafish_logFC,
           signif,
           zebrafish_gene,
           mouse_gene) %>%
ggplot(aes(x = mouse_logFC,
           y = zebrafish_logFC,
           color = signif)) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = paste(zebrafish_gene,
                                               mouse_gene,
                                               sep = "/"))) +
    geom_vline(xintercept = 0) +
    geom_hline(yintercept = 0) +
    scale_color_brewer(palette = "Set2",
                       name = "Significant\nin mouse") +
    labs(x = "Mouse logFC",
         y = "Zebrafish logFC",
         title = "Differential Expression in Zebrafish and Mouse Tumors")

ggsave(file = "output/figures/zebrafish_mouse_diff_expression.png",
       width = 7,
       height = 7)

png(filename = "output/figures/zebrafish_mouse_diff_expression_heatmap.png",
    width = 2000,
    height = 2500,
    res = 300)
combined_data %>%
    mutate(gene = paste(zebrafish_gene, mouse_gene)) %>%
    select(gene,
           mean_of_tumor_fpkm,
           mean_of_muscle,
           mean_of_development,
           Xenograft_C2C12_VGLL2_NCOA2_mean,
           Xenograft_C2C12_pcDNA_mean) %>%
    rename_with(~ str_replace(.x, "mean_of_", "Zebrafish ") %>%
                str_remove("_fpkm") %>%
                str_remove("_mean") %>%
                str_replace("Xeno", "Mm xeno") %>%
                str_replace("pcDNA", "ctrl") %>%
                str_replace_all("_", " ")) %>%
    distinct() %>%
    column_to_rownames("gene") %>%
    as.matrix() %>% 
    pheatmap::pheatmap(scale = "none",
                       display_numbers = TRUE,
                       number_color = "#9c9c9c",
                       legend = FALSE,
                       angle_col = 45,
                       color = colorRampPalette(rev(RColorBrewer::brewer.pal(n = 7,
                                                                         name = "YlOrRd")))(100))
dev.off()

Mouse tumors vs skeletal muscle

VGLL2-NCOA2 vs human RMS (VGLL2-NCOA2)

VGLL2-NCOA2 vs fish skeletal muscle

human RMS (VGLL2-NCOA2) vs human skeletal muscle

SkM from GTEX

Comparing DEPMAP gene effect and gene expression values for ARF6

irdb <- read_tsv("iRDB_Arf6_RNAseq.txt", comment = "#") %>%
    mutate(diagnosis_short = fct_reorder(diagnosis_short, FPKM)) %>%
    select(diagnosis_short,
           FPKM) %>%
    na.omit() %>%
    full_join(read_tsv("misc/iRDB_DEPMAP_key.txt")) %>%
    left_join(read_csv("CRISPR_DepMap_22Q1_Public_Score_Chronos_subsetted_Arf6.csv") %>%
        select(lineage_2, ARF6) %>%
        distinct())

ggplot(irdb, aes(x = diagnosis_short, y = FPKM)) +
    geom_boxplot() +
    geom_jitter(alpha = 0.5) +
    labs(x = "Diagnosis",
         y = "FPKM",
         title = "Tumor FPKM data from St. Jude iRDB")

ggplot(irdb, aes(x = ARF6, y = FPKM)) +
    geom_point()

test <- read_csv("misc/ARF6 Expression 22Q1 Public.csv") %>%
    rename(depmap_id = `Depmap ID`,
           ARF6_expr = `Expression 22Q1 Public`,
           cell_line_display_name = `Cell Line Name`) %>%
    full_join(read_csv("CRISPR_DepMap_22Q1_Public_Score_Chronos_subsetted_Arf6.csv") %>%
        select(depmap_id,
               cell_line_display_name,
               lineage_1,
               lineage_2,
               lineage_3,
               ARF6)) %>%
    rename(ARF6_gene_effect = ARF6)

ggplot(test, aes(x = ARF6_gene_effect,
                 y = ARF6_expr,
                 color = lineage_2 == "Rhabdomyosarcoma")) +
    geom_point() +
    geom_point(data = test %>%
               filter(lineage_2 == "Rhabdomyosarcoma"),
               aes(x = ARF6_gene_effect,
                   y = ARF6_expr),
                   color = "#fc8d62",
                   size = 3) +
    ggrepel::geom_text_repel(data = test %>%
                                filter(lineage_2 == "Rhabdomyosarcoma"),
                             aes(x = ARF6_gene_effect,
                                 y = ARF6_expr,
                                 label = cell_line_display_name),
                                 color = "black",
                                 size = 2) +
    labs(x = "ARF6 Gene Effect",
         y = "Expression 22Q1 Public",
         title = "Expression of ARF6 vs DepMap Effect") +
    scale_color_brewer(palette = "Set2",
                       name = "RMS")
ggsave("output/figures/depmap_arf6_expression.png",
       width = 6,
       height = 5)

Tead motifs upstream of 27 developmental genes

read_tsv("VGLL2NCOA2_27genelist.txt", show_col_types = FALSE) %>%
    select(gene) %>%
    left_join(read_tsv("/gpfs0/home1/gdkendalllab/lab/analyses/mvc002/refs/zebrafish_gene_info.txt.gz") %>%
                select(`Gene name`, `Gene stable ID`) %>%
                distinct() %>%
                rename(gene = `Gene name`,
                    gene_stable_id = `Gene stable ID`)

/gpfs0/home1/gdkendalllab/lab/analyses/mvc002/refs/zebrafish_gene_info.txt.gz 

/gpfs0/home1/gdkendalllab/lab/analyses/mvc002/refs/fasta/danRer11_gene_promoters.fasta.gz

Count reads mapping across VGLL2-NCOA2 fusion junction

get_data <- function(sample_name) {
  filter_cmd <- paste0("cat output/fusionAligned/",
                      sample_name,
                      "Aligned.out.sam | grep NCOA | awk '$4 <= 341 && $8 >= 441 {print $_}'")
  data_df <- system(filter_cmd, intern = TRUE) %>%
    as.data.frame() %>%
    separate(.,
             col = `.`,
             sep = "\t",
             into = c(NA, NA, NA, "pos_1", "mapq", NA, "chr_2", "pos_2", NA, NA, NA, NA, NA, NA, NA)) %>%
    filter(chr_2 == "=" &
             mapq > 200) %>%
    distinct()

    if (nrow(data_df) > 0) {
        plot_name <- data_df %>%
            mutate(pos_1 = as.numeric(pos_1),
                pos_2 = as.numeric(pos_2)) %>%
            arrange(pos_1, pos_2) %>%
            mutate(row_num = 1:n()) %>%
            select(pos_1, pos_2, row_num) %>%
            pivot_longer(cols = -row_num, names_to = "pos", values_to = "base") %>%
            ggplot(aes(x = as.numeric(base), y = row_num, group = row_num)) +
            geom_vline(xintercept = 391,
                       color = "red") +
            geom_point() +
            geom_line() +
            ggtitle(sample_name) +
            labs(x = "Base", y = "")

        print(plot_name)
    }

  return(data_df)
}

samples <- c("A377-A378T1",
             "A377-A378T2",
             "A377-A378T3",
             "A377-A378T4",
             "A377-A378T5",
             "A377-A378T6",
             "A377-A378T7",
             "A377-A378T8")

names(samples) <- samples

pdf("output/figures/fusionMappedPos.pdf")
fusion_mapped <- lapply(samples, function(x) get_data(x) %>% nrow())
dev.off()

fusion_mapped <- tibble(fusion_count = as.numeric(fusion_mapped),
                        sample_name = names(fusion_mapped))

get_reads_cmd <- "grep 'Uniquely mapped reads number' output/aligned/*Log.final.out"

fusion_data <- tibble(data_in = system(get_reads_cmd, intern = TRUE)) %>%
    separate(data_in,
             sep = "\t",
             into = c("sample_name", "reads_mapped")) %>%
    mutate(sample_name = sample_name %>%
                str_remove("output/aligned/") %>%
                str_remove("Log.final.out:.+"),
                reads_mapped = as.numeric(reads_mapped),
           M_reads_mapped = reads_mapped / 1000000) %>%
    full_join(fusion_mapped,
              by = "sample_name") %>%
    mutate(fusions_per_M = fusion_count / M_reads_mapped) %>%
    select(-M_reads_mapped)

fusion_data %>%
    gt::gt() %>%
    gt::fmt_number(columns = 2, decimals = 0, sep_mark = ",") %>%
    gt::fmt_number(columns = 4, decimals = 2) %>%
    gt::cols_label(sample_name = "Sample",
                   reads_mapped = "Reads Mapped",
                   fusion_count = "Fusions Detected",
                   fusions_per_M = md("Fusions per Million Reads")) %>%
    gt::cols_width(fusions_per_M ~ px(125),
                   fusion_count ~ px(100)) %>%
    gt::gtsave(file = "output/figures/fusionsMapped.png")

Testing for correlation between fusion counts and gene expression of up-regulated genes

de_results <-
    full_join(fusion_data, read_tsv("output/MmTumors_edgeR.txt", show_col_types = FALSE) %>%
        column_to_rownames("Gene") %>%
        filter(signif == "TRUE" & logFC > 0) %>%
        select(starts_with("A377")) %>%
        select(-`A377-A378T7`, -`A377-A378T8`) %>%
        t() %>%
        as.data.frame() %>%
        rownames_to_column("sample_name"))


ggplot(de_results, aes(x = fusions_per_M, y = Tspan33)) +
    geom_smooth(method = "lm", se = FALSE) +
    geom_point() +
    ggtitle("Tspan33")

cor.test(de_results$fusions_per_M, de_results$Wisp1)

cor_testing <- tibble(p_val = lapply(colnames(de_results)[5:ncol(de_results)],
                      function(x) cor.test(de_results$fusions_per_M,
                                           de_results[[x]])$p.value) %>%
                        as.numeric(),
                      pearson_cor = lapply(colnames(de_results)[5:ncol(de_results)],
                      function(x) cor.test(de_results$fusions_per_M,
                                           de_results[[x]])$estimate) %>%
                        as.numeric(),
                      gene = colnames(de_results)[5:ncol(de_results)]) %>%
    arrange(p_val) %>%
    mutate(fdr = qvalue::qvalue(p_val)$qvalues)



names(test) <- colnames(de_results)[5:ncol(de_results)]

AGDEX analysis

Keep this chunk last in the analysis since it borks my namespace >:(

Compare: - VGLL2NCOA2 human and fish tumors vs. mature skeletal muscle - VGLL2NCOA2 human and fish tumors vs. KRAS driven RMS - Mouse VGLL2NCOA2 and fish tumors vs. mature skeletal muscle
# Need data frame with rownames equal to the column names for the gene expression data
# and columns showing the groups
library(Biobase)
library(AGDEX)
cmp_table <- tibble(group_1 = c("DrFusion_DrSkMu",
                                "DrFusion_DrSkMu",
                                "DrFusion_DrSkMu",
                                "HsFusion_HsSkMu",
                                "HsFusion_HsSkMu"),
                    group_2 = c("HsFusion_HsSkMu",
                                "MmFusion_MmSkMu",
                                "DrKras_DrKrasCtrl",
                                "DrKras_DrKrasCtrl",
                                "MmFusion_MmSkMu"))

species_full <- list(Hs = "Homo sapiens",
                     Dr = "Danio rerio",
                     Mm = "Mus musculus")

for (i in seq_len(nrow(cmp_table))) {
    comparison_1 <- cmp_table$group_1[i]
    comparison_2 <- cmp_table$group_2[i]
    species_1 <- species_full[[substr(comparison_1, 1, 2)]]
    species_2 <- species_full[[substr(comparison_2, 1, 2)]]

    if (species_1 != species_2) {
        # get homologs
        orthologs <- read_tsv("../refs/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz",
                        show_col_types = FALSE,
                        comment = "#") %>%
            dplyr::filter(Gene1SpeciesName == species_1 &
                        Gene2SpeciesName == species_2) %>%
            dplyr::select(group_1_gene = Gene1Symbol,
                        group_2_gene = Gene2Symbol) %>%
            distinct()
    } else {
        orthologs <- read_tsv("../refs/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz",
                        show_col_types = FALSE,
                        comment = "#") %>%
            dplyr::filter(Gene1SpeciesName == species_1) %>%
            dplyr::select(group_1_gene = Gene1Symbol,
                        group_2_gene = Gene1Symbol) %>%
            distinct()
    }

    # Get normalized data
    data_1 <- read_tsv(paste0("output/de/",
                              cmp_table$group_1[i],
                              "_edgeR.txt"),
                       show_col_types = FALSE) %>%
        dplyr::select(-matches("_mean"),
               -logFC,
               -logCPM,
               -PValue,
               -FDR,
               -signif) %>%
        dplyr::filter(Gene %in% orthologs$group_1_gene) %>%
        column_to_rownames("Gene")

    data_2 <- read_tsv(paste0("output/de/",
                              cmp_table$group_2[i],
                              "_edgeR.txt"),
                       show_col_types = FALSE) %>%
        dplyr::select(-matches("_mean"),
               -logFC,
               -logCPM,
               -PValue,
               -FDR,
               -signif) %>%
        dplyr::filter(Gene %in% orthologs$group_2_gene) %>%
        column_to_rownames("Gene")

    group_1_pheno = data.frame(row.names = colnames(data_1),
                            grp = readRDS(paste0("output/de/",
                                            comparison_1,
                                            "_groups.rds")))

    group_2_pheno = data.frame(row.names = colnames(data_2),
                            grp = readRDS(paste0("output/de/",
                                            comparison_2,
                                            "_groups.rds")))

    # Do de testing on each dataset
    group_1_dex_set <-
        make.dex.set.object(
            Eset.data = ExpressionSet(assayData = as.matrix(data_1),
                                      phenoData = AnnotatedDataFrame(group_1_pheno)),
            comp.var = "grp",
            comp.def = str_replace(comparison_1, "_", "-"),
            gset.collection = NULL)

    group_2_dex_set <-
        make.dex.set.object(
            Eset.data = ExpressionSet(assayData = as.matrix(data_2),
                                      phenoData = AnnotatedDataFrame(group_2_pheno)),
            comp.var = "grp",
            comp.def = str_replace(comparison_2, "_", "-"),
            gset.collection = NULL)


    # Make up list of map.data

    map_data <- list(probe.map = orthologs,
                     map.Aprobe.col = 1,
                     map.Bprobe.col = 2)

    # Run agdex
    agdex_res <- AGDEX::agdex(dex.setA = group_1_dex_set,
                              dex.setB = group_2_dex_set,
                              map.data = map_data,
                              min.nperms = 5000,
                              max.nperms = 10000)

    write_tsv(agdex_res$gwide.agdex.result,
                file = paste0("output/agdex/",
                              comparison_1,
                              "_vs_",
                              comparison_2,
                              ".txt"))

    # agdex_res$meta.dex.res %>%
    #     mutate(signif = paste(dpvalA <= 0.05, dpvalB <= 0.05),
    #            direction = paste(dstatA > 0, dstatB > 0)) %>%
    #     dplyr::filter(signif != "FALSE FALSE") %>%
    #     ggplot(aes(x = dstatA, y = dstatB, color = direction)) +
    #     geom_point(alpha = 0.3, size = 1) +
    #     #scale_x_log10() +
    #     #scale_y_log10() +
    #     facet_wrap(~direction) +
    #     xlim(-1000, 1000) +
    #     ylim(-1000, 1000) +
    #     labs(x = comparison_1,
    #          y = comparison_2)

    png(paste0("output/agdex/",
                  comparison_1,
                  "_vs_",
                  comparison_2,
                  ".png"),
           width = 3000,
           height = 3000,
           res = 300)
    agdex.scatterplot(agdex_res)
    dev.off()

    merged_results <-
        read_tsv(paste0("output/de/",
                    cmp_table$group_1[i],
                    "_edgeR.txt"),
             show_col_types = FALSE) %>%
        dplyr::select(Gene,
                      logFC,
                      signif) %>%
        dplyr::filter(Gene %in% orthologs$group_1_gene) %>%
        dplyr::rename(group_1_gene = Gene,
                      g1_logfc = logFC,
                      g1_signif = signif) %>%
        left_join(orthologs, by = "group_1_gene") %>%
        inner_join(read_tsv(paste0("output/de/",
                                   cmp_table$group_2[i],
                                   "_edgeR.txt"),
                            show_col_types = FALSE) %>%
            dplyr::select(Gene,
                          matches("_mean"),
                          logFC,
                          signif) %>%
            dplyr::filter(Gene %in% orthologs$group_2_gene) %>%
            dplyr::rename(group_2_gene = Gene,
                          g2_logfc = logFC,
                          g2_signif = signif) %>%
            left_join(orthologs)) %>%
        mutate(signif = paste(g1_signif, g2_signif),
               direction = if_else(g1_logfc > 0 & g2_logfc > 0,
                                   "up",
                                   if_else(g1_logfc < 0 & g2_logfc < 0,
                                            "down",
                                            "discordant"))) %>%
        group_by(direction) %>%
        mutate(direction = str_c(direction, " (", n(), ")"))

    lm_test <- lm(g2_logfc ~ g1_logfc, data = merged_results) %>%
        summary()

    cor_test <- cor.test(merged_results$g1_logfc,
                         merged_results$g2_logfc,
                         method = "pearson")

    # Calculate the proportion of genes are in the upper right and lower left of the plot
    prop_q2_q3 <- sum((merged_results$g1_logfc > 0 &
                        merged_results$g2_logfc > 0) |
                   (merged_results$g1_logfc < 0 &
                        merged_results$g2_logfc < 0)) / nrow(merged_results)

    # Permute over random samples from kernal density function
    g1_density <- density(merged_results$g1_logfc, n = 2000)
    g2_density <- density(merged_results$g2_logfc, n = 2000)
    n_perm <- 100000
    n_perm_higher <- 0
    for (i in seq_len(n_perm)) {
        g1_perm <- sample(g1_density$x,
                        nrow(merged_results),
                        prob = g1_density$y,
                        replace = TRUE) +
            rnorm(nrow(merged_results), 0, g1_density$bw)

        g2_perm <- sample(g2_density$x,
                        nrow(merged_results),
                        prob = g2_density$y,
                        replace = TRUE) +
            rnorm(nrow(merged_results), 0, g1_density$bw)

        perm_prop <- sum((g1_perm > 0 & g2_perm > 0) |
                            (g1_perm < 0 & g2_perm < 0)) /
                            nrow(merged_results)

        if (perm_prop >= prop_q2_q3) {
            n_perm_higher <- n_perm_higher + 1
        }

        if (i %% 1000 == 0) {
            message("permutation: ", i)
        }
    }

    perm_p <- n_perm_higher/ n_perm
    if (perm_p == 0) {
        perm_p <- paste0("<", 1/n_perm)
    }

    ggplot(merged_results, aes(x = g1_logfc,
                               y = g2_logfc,
                               color = direction)) +
        geom_point(alpha = 0.4, size = 1) +
        labs(x = comparison_1,
             y = comparison_2,
             title = paste0(comparison_1,
                            " vs ",
                            comparison_2,
                            "\nSlope:",
                            round(lm_test$coefficients[2], digits =2),
                            ", R-squared:",
                            round(lm_test$r.squared, digits = 2),
                            "\nPearson: ",
                            round(cor_test$estimate, digits = 2),
                            ", p-value: ",
                            sprintf("%.2e", cor_test$p.value),
                            "\nPerm p-value: ",
                            perm_p)) +
        scale_color_manual(values = c("gray",
                                      "#2D2DE5",
                                      "#D32821"),
                           name = "Direction") +
        geom_smooth(method = "lm",
                    color = "black",
                    se = FALSE)

    ggsave(paste0("output/agdex/",
                  comparison_1,
                  "_vs_",
                  comparison_2,
                  "_raw_data.png"),
           height = 8,
           width = 10)
}
## Preparing differential expression data sets (dex.setA and dex.setB): Wed Jun  8 16:44:31 2022 
## Preparing data that maps probe set IDs across experiments: Wed Jun  8 16:44:32 2022 
## Computing statistics for observed data: Wed Jun  8 16:44:32 2022 
## Permuting experiment A data 10000 times: Wed Jun  8 16:44:32 2022 
## Computing time for permutation analysis of data set A (in seconds): 
## 10.928 0.007 10.963 0 0 
## Permuting experiment B data 10000 times: Wed Jun  8 16:44:43 2022 
## Computing time for permutation analysis of data set B (in seconds): 
## 153.61 0.055 154.2 0 0 
## Done: Wed Jun  8 16:47:17 2022
## Joining, by = "group_2_gene"Joining, by = c("group_1_gene", "group_2_gene")permutation: 1000
## permutation: 2000
## permutation: 3000
## permutation: 4000
## permutation: 5000
## permutation: 6000
## permutation: 7000
## permutation: 8000
## permutation: 9000
## permutation: 10000
## permutation: 11000
## permutation: 12000
## permutation: 13000
## permutation: 14000
## permutation: 15000
## permutation: 16000
## permutation: 17000
## permutation: 18000
## permutation: 19000
## permutation: 20000
## permutation: 21000
## permutation: 22000
## permutation: 23000
## permutation: 24000
## permutation: 25000
## permutation: 26000
## permutation: 27000
## permutation: 28000
## permutation: 29000
## permutation: 30000
## permutation: 31000
## permutation: 32000
## permutation: 33000
## permutation: 34000
## permutation: 35000
## permutation: 36000
## permutation: 37000
## permutation: 38000
## permutation: 39000
## permutation: 40000
## permutation: 41000
## permutation: 42000
## permutation: 43000
## permutation: 44000
## permutation: 45000
## permutation: 46000
## permutation: 47000
## permutation: 48000
## permutation: 49000
## permutation: 50000
## permutation: 51000
## permutation: 52000
## permutation: 53000
## permutation: 54000
## permutation: 55000
## permutation: 56000
## permutation: 57000
## permutation: 58000
## permutation: 59000
## permutation: 60000
## permutation: 61000
## permutation: 62000
## permutation: 63000
## permutation: 64000
## permutation: 65000
## permutation: 66000
## permutation: 67000
## permutation: 68000
## permutation: 69000
## permutation: 70000
## permutation: 71000
## permutation: 72000
## permutation: 73000
## permutation: 74000
## permutation: 75000
## permutation: 76000
## permutation: 77000
## permutation: 78000
## permutation: 79000
## permutation: 80000
## permutation: 81000
## permutation: 82000
## permutation: 83000
## permutation: 84000
## permutation: 85000
## permutation: 86000
## permutation: 87000
## permutation: 88000
## permutation: 89000
## permutation: 90000
## permutation: 91000
## permutation: 92000
## permutation: 93000
## permutation: 94000
## permutation: 95000
## permutation: 96000
## permutation: 97000
## permutation: 98000
## permutation: 99000
## permutation: 100000
## `geom_smooth()` using formula 'y ~ x'
## Preparing differential expression data sets (dex.setA and dex.setB): Wed Jun  8 16:53:40 2022 
## Preparing data that maps probe set IDs across experiments: Wed Jun  8 16:53:41 2022 
## Computing statistics for observed data: Wed Jun  8 16:53:41 2022 
## Permuting experiment A data 10000 times: Wed Jun  8 16:53:41 2022 
## Computing time for permutation analysis of data set A (in seconds): 
## 9.595 0.009 9.627 0 0 
## Permuting experiment B data 10000 times: Wed Jun  8 16:53:50 2022 
## Computing time for permutation analysis of data set B (in seconds): 
## 9.613 0.001 9.639 0 0 
## Done: Wed Jun  8 16:54:00 2022
## Joining, by = "group_2_gene"Joining, by = c("group_1_gene", "group_2_gene")permutation: 1000
## permutation: 2000
## permutation: 3000
## permutation: 4000
## permutation: 5000
## permutation: 6000
## permutation: 7000
## permutation: 8000
## permutation: 9000
## permutation: 10000
## permutation: 11000
## permutation: 12000
## permutation: 13000
## permutation: 14000
## permutation: 15000
## permutation: 16000
## permutation: 17000
## permutation: 18000
## permutation: 19000
## permutation: 20000
## permutation: 21000
## permutation: 22000
## permutation: 23000
## permutation: 24000
## permutation: 25000
## permutation: 26000
## permutation: 27000
## permutation: 28000
## permutation: 29000
## permutation: 30000
## permutation: 31000
## permutation: 32000
## permutation: 33000
## permutation: 34000
## permutation: 35000
## permutation: 36000
## permutation: 37000
## permutation: 38000
## permutation: 39000
## permutation: 40000
## permutation: 41000
## permutation: 42000
## permutation: 43000
## permutation: 44000
## permutation: 45000
## permutation: 46000
## permutation: 47000
## permutation: 48000
## permutation: 49000
## permutation: 50000
## permutation: 51000
## permutation: 52000
## permutation: 53000
## permutation: 54000
## permutation: 55000
## permutation: 56000
## permutation: 57000
## permutation: 58000
## permutation: 59000
## permutation: 60000
## permutation: 61000
## permutation: 62000
## permutation: 63000
## permutation: 64000
## permutation: 65000
## permutation: 66000
## permutation: 67000
## permutation: 68000
## permutation: 69000
## permutation: 70000
## permutation: 71000
## permutation: 72000
## permutation: 73000
## permutation: 74000
## permutation: 75000
## permutation: 76000
## permutation: 77000
## permutation: 78000
## permutation: 79000
## permutation: 80000
## permutation: 81000
## permutation: 82000
## permutation: 83000
## permutation: 84000
## permutation: 85000
## permutation: 86000
## permutation: 87000
## permutation: 88000
## permutation: 89000
## permutation: 90000
## permutation: 91000
## permutation: 92000
## permutation: 93000
## permutation: 94000
## permutation: 95000
## permutation: 96000
## permutation: 97000
## permutation: 98000
## permutation: 99000
## permutation: 100000
## `geom_smooth()` using formula 'y ~ x'
## Preparing differential expression data sets (dex.setA and dex.setB): Wed Jun  8 16:59:27 2022 
## Preparing data that maps probe set IDs across experiments: Wed Jun  8 16:59:28 2022 
## Computing statistics for observed data: Wed Jun  8 16:59:28 2022 
## Permuting experiment A data 10000 times: Wed Jun  8 16:59:28 2022 
## Computing time for permutation analysis of data set A (in seconds): 
## 10.126 0.001 10.157 0 0 
## Permuting experiment B data 20 times: Wed Jun  8 16:59:39 2022 
## Computing time for permutation analysis of data set B (in seconds): 
## 0.014 0 0.014 0 0 
## Done: Wed Jun  8 16:59:39 2022
## Joining, by = "group_2_gene"Joining, by = c("group_1_gene", "group_2_gene")permutation: 1000
## permutation: 2000
## permutation: 3000
## permutation: 4000
## permutation: 5000
## permutation: 6000
## permutation: 7000
## permutation: 8000
## permutation: 9000
## permutation: 10000
## permutation: 11000
## permutation: 12000
## permutation: 13000
## permutation: 14000
## permutation: 15000
## permutation: 16000
## permutation: 17000
## permutation: 18000
## permutation: 19000
## permutation: 20000
## permutation: 21000
## permutation: 22000
## permutation: 23000
## permutation: 24000
## permutation: 25000
## permutation: 26000
## permutation: 27000
## permutation: 28000
## permutation: 29000
## permutation: 30000
## permutation: 31000
## permutation: 32000
## permutation: 33000
## permutation: 34000
## permutation: 35000
## permutation: 36000
## permutation: 37000
## permutation: 38000
## permutation: 39000
## permutation: 40000
## permutation: 41000
## permutation: 42000
## permutation: 43000
## permutation: 44000
## permutation: 45000
## permutation: 46000
## permutation: 47000
## permutation: 48000
## permutation: 49000
## permutation: 50000
## permutation: 51000
## permutation: 52000
## permutation: 53000
## permutation: 54000
## permutation: 55000
## permutation: 56000
## permutation: 57000
## permutation: 58000
## permutation: 59000
## permutation: 60000
## permutation: 61000
## permutation: 62000
## permutation: 63000
## permutation: 64000
## permutation: 65000
## permutation: 66000
## permutation: 67000
## permutation: 68000
## permutation: 69000
## permutation: 70000
## permutation: 71000
## permutation: 72000
## permutation: 73000
## permutation: 74000
## permutation: 75000
## permutation: 76000
## permutation: 77000
## permutation: 78000
## permutation: 79000
## permutation: 80000
## permutation: 81000
## permutation: 82000
## permutation: 83000
## permutation: 84000
## permutation: 85000
## permutation: 86000
## permutation: 87000
## permutation: 88000
## permutation: 89000
## permutation: 90000
## permutation: 91000
## permutation: 92000
## permutation: 93000
## permutation: 94000
## permutation: 95000
## permutation: 96000
## permutation: 97000
## permutation: 98000
## permutation: 99000
## permutation: 100000
## `geom_smooth()` using formula 'y ~ x'
## Preparing differential expression data sets (dex.setA and dex.setB): Wed Jun  8 17:05:19 2022 
## Preparing data that maps probe set IDs across experiments: Wed Jun  8 17:05:20 2022 
## Computing statistics for observed data: Wed Jun  8 17:05:20 2022 
## Permuting experiment A data 10000 times: Wed Jun  8 17:05:21 2022 
## Computing time for permutation analysis of data set A (in seconds): 
## 152.716 0.053 153.221 0 0 
## Permuting experiment B data 20 times: Wed Jun  8 17:07:54 2022 
## Computing time for permutation analysis of data set B (in seconds): 
## 0.014 0 0.014 0 0 
## Done: Wed Jun  8 17:07:54 2022
## Joining, by = "group_2_gene"Joining, by = c("group_1_gene", "group_2_gene")permutation: 1000
## permutation: 2000
## permutation: 3000
## permutation: 4000
## permutation: 5000
## permutation: 6000
## permutation: 7000
## permutation: 8000
## permutation: 9000
## permutation: 10000
## permutation: 11000
## permutation: 12000
## permutation: 13000
## permutation: 14000
## permutation: 15000
## permutation: 16000
## permutation: 17000
## permutation: 18000
## permutation: 19000
## permutation: 20000
## permutation: 21000
## permutation: 22000
## permutation: 23000
## permutation: 24000
## permutation: 25000
## permutation: 26000
## permutation: 27000
## permutation: 28000
## permutation: 29000
## permutation: 30000
## permutation: 31000
## permutation: 32000
## permutation: 33000
## permutation: 34000
## permutation: 35000
## permutation: 36000
## permutation: 37000
## permutation: 38000
## permutation: 39000
## permutation: 40000
## permutation: 41000
## permutation: 42000
## permutation: 43000
## permutation: 44000
## permutation: 45000
## permutation: 46000
## permutation: 47000
## permutation: 48000
## permutation: 49000
## permutation: 50000
## permutation: 51000
## permutation: 52000
## permutation: 53000
## permutation: 54000
## permutation: 55000
## permutation: 56000
## permutation: 57000
## permutation: 58000
## permutation: 59000
## permutation: 60000
## permutation: 61000
## permutation: 62000
## permutation: 63000
## permutation: 64000
## permutation: 65000
## permutation: 66000
## permutation: 67000
## permutation: 68000
## permutation: 69000
## permutation: 70000
## permutation: 71000
## permutation: 72000
## permutation: 73000
## permutation: 74000
## permutation: 75000
## permutation: 76000
## permutation: 77000
## permutation: 78000
## permutation: 79000
## permutation: 80000
## permutation: 81000
## permutation: 82000
## permutation: 83000
## permutation: 84000
## permutation: 85000
## permutation: 86000
## permutation: 87000
## permutation: 88000
## permutation: 89000
## permutation: 90000
## permutation: 91000
## permutation: 92000
## permutation: 93000
## permutation: 94000
## permutation: 95000
## permutation: 96000
## permutation: 97000
## permutation: 98000
## permutation: 99000
## permutation: 100000
## `geom_smooth()` using formula 'y ~ x'
## Preparing differential expression data sets (dex.setA and dex.setB): Wed Jun  8 17:14:06 2022 
## Preparing data that maps probe set IDs across experiments: Wed Jun  8 17:14:07 2022 
## Computing statistics for observed data: Wed Jun  8 17:14:08 2022 
## Permuting experiment A data 10000 times: Wed Jun  8 17:14:08 2022 
## Computing time for permutation analysis of data set A (in seconds): 
## 178.813 0.055 179.528 0 0 
## Permuting experiment B data 10000 times: Wed Jun  8 17:17:07 2022 
## Computing time for permutation analysis of data set B (in seconds): 
## 10.76 0.009 10.801 0 0 
## Done: Wed Jun  8 17:17:18 2022
## Joining, by = "group_2_gene"Joining, by = c("group_1_gene", "group_2_gene")permutation: 1000
## permutation: 2000
## permutation: 3000
## permutation: 4000
## permutation: 5000
## permutation: 6000
## permutation: 7000
## permutation: 8000
## permutation: 9000
## permutation: 10000
## permutation: 11000
## permutation: 12000
## permutation: 13000
## permutation: 14000
## permutation: 15000
## permutation: 16000
## permutation: 17000
## permutation: 18000
## permutation: 19000
## permutation: 20000
## permutation: 21000
## permutation: 22000
## permutation: 23000
## permutation: 24000
## permutation: 25000
## permutation: 26000
## permutation: 27000
## permutation: 28000
## permutation: 29000
## permutation: 30000
## permutation: 31000
## permutation: 32000
## permutation: 33000
## permutation: 34000
## permutation: 35000
## permutation: 36000
## permutation: 37000
## permutation: 38000
## permutation: 39000
## permutation: 40000
## permutation: 41000
## permutation: 42000
## permutation: 43000
## permutation: 44000
## permutation: 45000
## permutation: 46000
## permutation: 47000
## permutation: 48000
## permutation: 49000
## permutation: 50000
## permutation: 51000
## permutation: 52000
## permutation: 53000
## permutation: 54000
## permutation: 55000
## permutation: 56000
## permutation: 57000
## permutation: 58000
## permutation: 59000
## permutation: 60000
## permutation: 61000
## permutation: 62000
## permutation: 63000
## permutation: 64000
## permutation: 65000
## permutation: 66000
## permutation: 67000
## permutation: 68000
## permutation: 69000
## permutation: 70000
## permutation: 71000
## permutation: 72000
## permutation: 73000
## permutation: 74000
## permutation: 75000
## permutation: 76000
## permutation: 77000
## permutation: 78000
## permutation: 79000
## permutation: 80000
## permutation: 81000
## permutation: 82000
## permutation: 83000
## permutation: 84000
## permutation: 85000
## permutation: 86000
## permutation: 87000
## permutation: 88000
## permutation: 89000
## permutation: 90000
## permutation: 91000
## permutation: 92000
## permutation: 93000
## permutation: 94000
## permutation: 95000
## permutation: 96000
## permutation: 97000
## permutation: 98000
## permutation: 99000
## permutation: 100000
## `geom_smooth()` using formula 'y ~ x'